from IPython.display import Image
Image(filename=r"C:\Users\Dalkeith J Thomas\OneDrive - The University of the West Indies, Mona Campus\Data Science Projects\Title Page.png")
This notebook analyzes a marketing campaign by quantifying various metrics,in addition to visualizing data and running A/B testing. This is a fictional dataset from Datacamp.
Image(filename=r"C:\Users\Dalkeith J Thomas\OneDrive - The University of the West Indies, Mona Campus\Data Science Projects\2.png")
Image(filename=r"C:\Users\Dalkeith J Thomas\OneDrive - The University of the West Indies, Mona Campus\Data Science Projects\Aim.png")
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
from sklearn.model_selection import train_test_split
Image(filename=r"C:\Users\Dalkeith J Thomas\Downloads\Importing and Cleaning.png")
df=pd.read_csv(r"C:\Users\Dalkeith J Thomas\OneDrive - The University of the West Indies, Mona Campus\Data Science Projects\marketing_new.csv")
df.shape
(10037, 16)
df.head()
| Unnamed: 0 | user_id | date_served | marketing_channel | variant | converted | language_displayed | language_preferred | age_group | date_subscribed | date_canceled | subscribing_channel | is_retained | DoW | channel_code | is_correct_lang | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | a100000029 | 2018-01-01 | House Ads | personalization | True | English | English | 0-18 years | 2018-01-01 | NaN | House Ads | True | 0.0 | 1.0 | Yes |
| 1 | 1 | a100000030 | 2018-01-01 | House Ads | personalization | True | English | English | 19-24 years | 2018-01-01 | NaN | House Ads | True | 0.0 | 1.0 | Yes |
| 2 | 2 | a100000031 | 2018-01-01 | House Ads | personalization | True | English | English | 24-30 years | 2018-01-01 | NaN | House Ads | True | 0.0 | 1.0 | Yes |
| 3 | 3 | a100000032 | 2018-01-01 | House Ads | personalization | True | English | English | 30-36 years | 2018-01-01 | NaN | House Ads | True | 0.0 | 1.0 | Yes |
| 4 | 4 | a100000033 | 2018-01-01 | House Ads | personalization | True | English | English | 36-45 years | 2018-01-01 | NaN | House Ads | True | 0.0 | 1.0 | Yes |
The dataset has several columns most of which appear to be categorical. There are some columns that we will not be using in analysis, such as the unnamed column,is language correct column and channel code. As a result, I will be excluding those columns from the analysis.
df=df.drop(columns=['Unnamed: 0','is_correct_lang','channel_code'])
df.head()
| user_id | date_served | marketing_channel | variant | converted | language_displayed | language_preferred | age_group | date_subscribed | date_canceled | subscribing_channel | is_retained | DoW | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | a100000029 | 2018-01-01 | House Ads | personalization | True | English | English | 0-18 years | 2018-01-01 | NaN | House Ads | True | 0.0 |
| 1 | a100000030 | 2018-01-01 | House Ads | personalization | True | English | English | 19-24 years | 2018-01-01 | NaN | House Ads | True | 0.0 |
| 2 | a100000031 | 2018-01-01 | House Ads | personalization | True | English | English | 24-30 years | 2018-01-01 | NaN | House Ads | True | 0.0 |
| 3 | a100000032 | 2018-01-01 | House Ads | personalization | True | English | English | 30-36 years | 2018-01-01 | NaN | House Ads | True | 0.0 |
| 4 | a100000033 | 2018-01-01 | House Ads | personalization | True | English | English | 36-45 years | 2018-01-01 | NaN | House Ads | True | 0.0 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10037 entries, 0 to 10036 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 10037 non-null object 1 date_served 10021 non-null object 2 marketing_channel 10022 non-null object 3 variant 10037 non-null object 4 converted 10037 non-null bool 5 language_displayed 10037 non-null object 6 language_preferred 10037 non-null object 7 age_group 10037 non-null object 8 date_subscribed 1856 non-null object 9 date_canceled 577 non-null object 10 subscribing_channel 1856 non-null object 11 is_retained 10037 non-null bool 12 DoW 1856 non-null float64 dtypes: bool(2), float64(1), object(10) memory usage: 882.3+ KB
I will create a list of categorical variable, to check for the categorical options.
df_cat=df[["marketing_channel","variant","language_displayed","language_preferred","age_group","subscribing_channel"]]
for col in df_cat:
print(df[col].value_counts())
House Ads 4733 Instagram 1871 Facebook 1860 Push 993 Email 565 Name: marketing_channel, dtype: int64 control 5091 personalization 4946 Name: variant, dtype: int64 English 9793 Spanish 136 German 81 Arabic 27 Name: language_displayed, dtype: int64 English 9275 Spanish 450 German 167 Arabic 145 Name: language_preferred, dtype: int64 19-24 years 1682 24-30 years 1568 0-18 years 1539 30-36 years 1355 36-45 years 1353 45-55 years 1353 55+ years 1187 Name: age_group, dtype: int64 Instagram 600 Facebook 445 House Ads 354 Email 290 Push 167 Name: subscribing_channel, dtype: int64
df.isna().sum()
user_id 0 date_served 16 marketing_channel 15 variant 0 converted 0 language_displayed 0 language_preferred 0 age_group 0 date_subscribed 8181 date_canceled 9460 subscribing_channel 8181 is_retained 0 DoW 8181 dtype: int64
# I realized that Ids were often repeated and subscription date would be filled for one entry
df = df[df['user_id'].notnull() & df['date_subscribed'].notnull()]
df.isna().sum()
user_id 0 date_served 0 marketing_channel 0 variant 0 converted 0 language_displayed 0 language_preferred 0 age_group 0 date_subscribed 0 date_canceled 1279 subscribing_channel 0 is_retained 0 DoW 0 dtype: int64
I will be creating an is_active column, which gathers its data from date canceled, since the empty days within that column is due to the subscription being continued.
df['is_active'] = df['date_canceled'].isnull().astype('bool')
df.head()
| user_id | date_served | marketing_channel | variant | converted | language_displayed | language_preferred | age_group | date_subscribed | date_canceled | subscribing_channel | is_retained | DoW | is_active | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | a100000029 | 2018-01-01 | House Ads | personalization | True | English | English | 0-18 years | 2018-01-01 | NaN | House Ads | True | 0.0 | True |
| 1 | a100000030 | 2018-01-01 | House Ads | personalization | True | English | English | 19-24 years | 2018-01-01 | NaN | House Ads | True | 0.0 | True |
| 2 | a100000031 | 2018-01-01 | House Ads | personalization | True | English | English | 24-30 years | 2018-01-01 | NaN | House Ads | True | 0.0 | True |
| 3 | a100000032 | 2018-01-01 | House Ads | personalization | True | English | English | 30-36 years | 2018-01-01 | NaN | House Ads | True | 0.0 | True |
| 4 | a100000033 | 2018-01-01 | House Ads | personalization | True | English | English | 36-45 years | 2018-01-01 | NaN | House Ads | True | 0.0 | True |
I will be converting the date columns from object.
df_date=df[['date_served','date_subscribed','date_canceled',]]
for var in df_date:
df[var]=pd.to_datetime(df_date[var],format='%Y-%m-%d')
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1856 entries, 0 to 10036 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 1856 non-null object 1 date_served 1856 non-null datetime64[ns] 2 marketing_channel 1856 non-null object 3 variant 1856 non-null object 4 converted 1856 non-null bool 5 language_displayed 1856 non-null object 6 language_preferred 1856 non-null object 7 age_group 1856 non-null object 8 date_subscribed 1856 non-null datetime64[ns] 9 date_canceled 577 non-null datetime64[ns] 10 subscribing_channel 1856 non-null object 11 is_retained 1856 non-null bool 12 DoW 1856 non-null float64 13 is_active 1856 non-null bool dtypes: bool(3), datetime64[ns](3), float64(1), object(7) memory usage: 179.4+ KB
# Create 'Day_of_Week' column
df['DoW'] = df['date_served'].dt.day_name()
df_cat=df[["marketing_channel","variant","language_displayed","language_preferred","age_group","subscribing_channel","DoW"]]
df.head()
| user_id | date_served | marketing_channel | variant | converted | language_displayed | language_preferred | age_group | date_subscribed | date_canceled | subscribing_channel | is_retained | DoW | is_active | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | a100000029 | 2018-01-01 | House Ads | personalization | True | English | English | 0-18 years | 2018-01-01 | NaT | House Ads | True | Monday | True |
| 1 | a100000030 | 2018-01-01 | House Ads | personalization | True | English | English | 19-24 years | 2018-01-01 | NaT | House Ads | True | Monday | True |
| 2 | a100000031 | 2018-01-01 | House Ads | personalization | True | English | English | 24-30 years | 2018-01-01 | NaT | House Ads | True | Monday | True |
| 3 | a100000032 | 2018-01-01 | House Ads | personalization | True | English | English | 30-36 years | 2018-01-01 | NaT | House Ads | True | Monday | True |
| 4 | a100000033 | 2018-01-01 | House Ads | personalization | True | English | English | 36-45 years | 2018-01-01 | NaT | House Ads | True | Monday | True |
Image(filename=r"C:\Users\Dalkeith J Thomas\OneDrive - The University of the West Indies, Mona Campus\Data Science Projects\Exploratory D Analysis.png")
market_channel_cat=df['marketing_channel'].value_counts()
plt.figure(figsize=(8,6))
plt.pie(market_channel_cat, labels=market_channel_cat.index, autopct='%1.1f%%', startangle=140)
plt.title('Pie Chart of Marketing Channel Distribution')
plt.tight_layout()
plt.show()
The subscribing channel was mostly house ads, additionally social media was used extensively in this campaign, as Facebook and Instagram combine for over 40% of the distribution.
# Create the bar plot
plt.figure(figsize=(8, 6)) # Set the figure size (width, height) in inches
sns.countplot(x='marketing_channel',data=df,palette="Set3")
# Customize the plot
plt.title('Countplot showing Marketing Channels')
plt.xlabel('Marketing Channel')
plt.ylabel('Values')
# Show the plot
plt.tight_layout()
plt.show()
for col in df_cat:
plt.figure(figsize=(10, 6))
sns.countplot(y=col, data=df, order=df[col].value_counts().index,palette='Set3')
plt.title(f'Distribution of {col}')
plt.show()
The biggest age group are individuals from 19-24 years old,whereas the smaller is 55+ years old. Instagram is the biggest regarding subscribing channel and English is the most preferred language.
conversions_over_time = df.groupby(df['date_served'].dt.to_period('D'))['converted'].sum()
conversions_over_time.plot(kind='bar', figsize=(12, 6), title='Conversions Over Time')
plt.xlabel('Day')
plt.ylabel('Number of Conversions')
plt.show()
The number of conversions slowed as the month continued, there was a spike in conversion around the middle of the month. I am going to investigate this particular spike, over the three day period, to see if there are any additional insights.
spike_filtered=['2018-01-15','2018-01-16','2018-01-17']
spike_filtered_df=df[df['date_served'].isin(spike_filtered)]
excluded_df=df[~df['date_served'].isin(spike_filtered)]
for col in df_cat:
plt.figure(figsize=(10, 6))
sns.countplot(y=col, data=spike_filtered_df, order=df[col].value_counts().index,palette='Set3')
plt.title(f'Distribution of {col} for the spike days')
plt.show()
for col in df_cat:
plt.figure(figsize=(10, 6))
sns.countplot(y=col, data=excluded_df, order=df[col].value_counts().index,palette='Set3')
plt.title(f'Distribution of {col} with exclusions')
plt.show()
These particular days were characterized by highly personal emails in which 24-30 year olds, responded very favourably to. Emails that address recipients by name or their preferences can make the communication feel more personal and relevant. This demographic appreciates personalization, feeling that the brand understands and values them as individuals. This age group is generally very comfortable with digital communication and often uses email as a primary mode of receiving information and making decisions. They are more likely to open and act on emails if they perceive them as useful and engaging.
cancel_counts = df['is_retained'].value_counts()
plt.figure(figsize=(8, 6))
plt.pie(cancel_counts, labels=cancel_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Pie Chart Showing the Percentage of Active Users')
plt.legend(labels=cancel_counts.index, title="User Status", loc="center left", bbox_to_anchor=(1, 0.5))
plt.show()
convert_counts = df['converted'].value_counts()
plt.figure(figsize=(8, 6))
plt.pie(convert_counts, labels=convert_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Pie Chart Showing the Percentage of Converted Users')
plt.legend(labels=convert_counts.index, title="User Status", loc="center left", bbox_to_anchor=(1, 0.5))
plt.show()
The business retained more than 50% of its customers and also had a favourable conversion rate of over 50%
Image(filename=r"C:\Users\Dalkeith J Thomas\OneDrive - The University of the West Indies, Mona Campus\Data Science Projects\Feature Engineering.png")
Now I am going to general conversion and retention rates.
total_consumers=len(df)
total_converted=df[df['converted']==True].shape[0]
print(f"Total consumers : {total_consumers},Total converted : {total_converted}")
Total consumers : 1856,Total converted : 1050
total_conversion_rate=(total_converted/total_consumers)*100
print(f"The overall conversion rate is {total_conversion_rate:.2f}%")
The overall conversion rate is 56.57%
total_retained=df[df['is_retained']==True].shape[0]
total_retention_rate=(total_retained/total_consumers)*100
print(f"The overall retention rate is {total_retention_rate:.2f}%")
The overall retention rate is 68.91%
The company performed well in conversion metrics given that the rate is over 50%, in terms of retention rate the company performed average.
rates = {'Conversion Rate': total_conversion_rate, 'Retention Rate': total_retention_rate}
names = list(rates.keys())
values = list(rates.values())
colors = sns.color_palette("Set3", len(names))
plt.figure(figsize=(10, 6))
bars = plt.bar(names, values, color=colors)
plt.title('Conversion and Retention Rates')
plt.ylabel('Rate(%)')
plt.ylim(0, 100)
for bar in bars.patches:
y_val=bar.get_height()
plt.text(bar.get_x()+bar.get_width()/2,y_val+0.05,f'{y_val:.2f}',ha='center',va='bottom')
plt.show()
I am going to create a conversion and retention function, given that I would be retyping the same code multiple times.
def conversion(column):
column_conversion = df.groupby(column)['converted'].mean() * 100
colors = sns.color_palette('Set3')
plt.figure(figsize=(10,6))
bars = column_conversion.plot(kind='bar', color=colors, title=f'Conversion Rate by {column}')
plt.xlabel(column)
plt.ylabel('Conversion Rate (%)')
plt.xticks(rotation=45)
for bar in bars.patches:
yval = bar.get_height()
plt.text(bar.get_x() + bar.get_width()/2, yval + 0.5, f'{yval:.2f}%', ha='center', va='bottom')
plt.tight_layout()
plt.show()
def retention(column):
column_retention = df.groupby(column)['is_retained'].mean() * 100
colors = sns.color_palette('Set2')
plt.figure(figsize=(10,6))
bars = column_retention.plot(kind='bar', color=colors, title=f'Retention Rate by {column}')
plt.xlabel(column)
plt.ylabel('Retention Rate (%)')
plt.xticks(rotation=45)
for bar in bars.patches:
yval = bar.get_height()
plt.text(bar.get_x() + bar.get_width()/2, yval + 0.5, f'{yval:.2f}%', ha='center', va='bottom')
plt.tight_layout()
plt.show()
for col in df_cat:
conversion(col)
In terms of conversion rates, email was the best performing marketing channel. This is possibly due to the three day spike in the middle of the month mentioned early. Personalization also had a stronger outcome relative to control implying that personalization results in more conversions. German language displayed resulted in the most conversions regarding languages. The highest conversion rate fell into the 19-24 age category, 55+ years in addition to 24-30 year olds also had high conversion rates. In terms of subscribing channels, House Ads had the highest rate. Weekends had higher conversion rates relative to weekdays.
for col in df_cat:
retention(col)
Regarding marketing channels, retention rates are fairly close, with email slightly superceding other markeitng channels.Control had a higher retain relative to personalization, which is the opposite when dealing with conversion. All displayed languages had retention rates above 60%. In terms of age group the group with the largest retention rate were 55+ years old. House Ads had the lowest retention rate, though it had the highest conversion rate as mentioned earlier. The Monday and Tuesday signup group had a higher retention rate than other days.
def heatmap(var1,var2):
heatmap_data = pd.crosstab(df[var1], df[var2])
plt.figure(figsize=(10, 6))
sns.heatmap(heatmap_data, annot=True, fmt="d", cmap="YlGnBu")
plt.title(f'Frequency of {var1} by {var2}')
plt.xlabel(var2)
plt.ylabel(var1)
plt.show()
heatmap('age_group','marketing_channel')
The current marketing strategy might not be effectively resonating with the 36-45 and 45-55 age groups.House ads, as a marketing channel, seem to be less effective overall, especially for these age groups.There might be a misalignment between the messaging/content of the house ads and the preferences or needs of the targeted age groups. Firstly,we could evaluate the content of the house ads. Conduct focus groups or surveys with the 36-45 and 45-55 age groups to understand their preferences and pain points. Thus, tailoring the message to better align with the interests and needs of these age groups. Secondly, experiment with different marketing channels for the 36-45 and 45-55 age groups. For example, consider social media platforms, email marketing, or partnerships with influencers that cater to these demographics.
def ratemap(val,ind,col):
heatmap_data = df.pivot_table(values=(val), index=(ind), columns=(col), aggfunc='mean')
plt.figure(figsize=(10, 6))
sns.heatmap(heatmap_data, annot=True, fmt=".2f", cmap="viridis")
plt.title(f'{val} Rates by {col} and {ind}')
plt.xlabel(col)
plt.ylabel(ind)
plt.show()
ratemap('converted','age_group','marketing_channel')
Though age groups 36-45 years old and 45-55 years old have such a low number of individuals within the email category, its conversion rate is extremely high in both. Thus, to attract more converts from the 36-45 group, they must target email based advertising, as this is the only channel with a favourable conversion rate (above 50%). The 45-55 year old group has high conversion rates with emails and push notifications, as such these are the areas which must be targeted to increase the feedback from this group. The 30-36 year old group would also do well with an increase, with this in mind email is best received and as such resources should be specialized there.
ratemap('converted','age_group','variant')
The personalization of the notification was best received by the youngest age group (those below 18 years old). All groups enjoyed the personal touch,though the conversion rate decreased with age, the conversion rate superceded the control rate for all groups, bar those who are 55+ years old.
ratemap('converted','age_group','DoW')
Additionally, those aged 30-45 years old, respond most favourably on a Saturday.Whereas those in the 45-55 year old groups, respond most favourable on a Sunday. As a result, to improve conversion rates for these three groups, based on analysis it is best to offer personalized email advertisements on Saturdays for 30-45 year olds, and on Sunday for 45-55 year olds.
Image(filename=r"C:\Users\Dalkeith J Thomas\OneDrive - The University of the West Indies, Mona Campus\Data Science Projects\AB testing.png")
For the purpose of our A/B analysis, we will define:
Null Hypothesis (H0): There is no difference in the conversion rates between the personalization and control groups.
Alternative Hypothesis (H1): There is a difference in the conversion rates between the personalization and control groups.
I will be using a chi-squared test, since this test handles categorical, proportional data with binary outcomes easily. I will be investigating if there is a statistical significance between personalization (Advertisements with a personal touch) and control (Advertisements which stayed the same) ads.
conversion_rates = df.groupby('variant')['converted'].mean() * 100
print(conversion_rates)
variant control 37.870472 personalization 74.603175 Name: converted, dtype: float64
heatmap('converted','variant')
contingency_table = pd.crosstab(df['converted'], df['variant'])
chi2, p, dof, ex = chi2_contingency(contingency_table)
print(f"Chi-square: {chi2}, p-value: {p}")
#Assuming Alpha is 0.05
alpha = 0.05
if p < alpha:
print("Reject the null hypothesis: There is a significant difference between the groups.")
else:
print("Fail to reject the null hypothesis: There is no significant difference between the groups.")
Chi-square: 253.25434002346478, p-value: 5.069698946689019e-57 Reject the null hypothesis: There is a significant difference between the groups.
Image(filename=r"C:\Users\Dalkeith J Thomas\OneDrive - The University of the West Indies, Mona Campus\Data Science Projects\Importing and Cleaning.png")
df['converted']=df['converted'].astype(int)
df['is_retained']=df['is_retained'].astype(int)
columns_to_drop = ['user_id', 'date_served','date_subscribed','is_active','date_canceled']
df.drop(columns=columns_to_drop, inplace=True)
encoded_df = pd.get_dummies(df, columns=["marketing_channel","variant","language_displayed","language_preferred","age_group","subscribing_channel","DoW"])
print(encoded_df.head())
converted is_retained marketing_channel_Email \ 0 1 1 0 1 1 1 0 2 1 1 0 3 1 1 0 4 1 1 0 marketing_channel_Facebook marketing_channel_House Ads \ 0 0 1 1 0 1 2 0 1 3 0 1 4 0 1 marketing_channel_Instagram marketing_channel_Push variant_control \ 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 variant_personalization language_displayed_Arabic ... \ 0 1 0 ... 1 1 0 ... 2 1 0 ... 3 1 0 ... 4 1 0 ... subscribing_channel_House Ads subscribing_channel_Instagram \ 0 1 0 1 1 0 2 1 0 3 1 0 4 1 0 subscribing_channel_Push DoW_Friday DoW_Monday DoW_Saturday DoW_Sunday \ 0 0 0 1 0 0 1 0 0 1 0 0 2 0 0 1 0 0 3 0 0 1 0 0 4 0 0 1 0 0 DoW_Thursday DoW_Tuesday DoW_Wednesday 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 [5 rows x 36 columns]
X = encoded_df.drop('converted', axis=1)
y = encoded_df['converted']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,confusion_matrix
# Define classifiers
classifiers = {
'Logistic Regression': LogisticRegression(),
'Random Forest': RandomForestClassifier(),
'Support Vector Machine': SVC(probability=True),
'K-Nearest Neighbors': KNeighborsClassifier(),
'Decision Tree': DecisionTreeClassifier(),
'Naive Bayes': GaussianNB()
}
for clf_name, clf in classifiers.items():
# Train the classifier
clf.fit(X_train, y_train)
# Make predictions on the testing data
y_pred = clf.predict(X_test)
# Calculate accuracy, precision, and recall
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1=f1_score(y_test,y_pred)
# Calculate confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, cmap='Blues', fmt='g')
plt.title(f'Confusion Matrix - {clf_name}')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()
# Print the scores for the current classifier
print(f'{clf_name} Metrics:')
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1}')
print('\n')
Logistic Regression Metrics: Accuracy: 0.8441 Precision: 0.8263 Recall: 0.9198 F1 Score: 0.8705357142857143
Random Forest Metrics: Accuracy: 0.9032 Precision: 0.9151 Recall: 0.9151 F1 Score: 0.9150943396226415
Support Vector Machine Metrics: Accuracy: 0.8898 Precision: 0.8834 Recall: 0.9292 F1 Score: 0.9057471264367817
K-Nearest Neighbors Metrics: Accuracy: 0.8710 Precision: 0.8661 Recall: 0.9151 F1 Score: 0.8899082568807338
Decision Tree Metrics: Accuracy: 0.8871 Precision: 0.9250 Recall: 0.8726 F1 Score: 0.8980582524271845
Naive Bayes Metrics: Accuracy: 0.6317 Precision: 0.8319 Recall: 0.4434 F1 Score: 0.5784615384615385
from sklearn.metrics import roc_curve, roc_auc_score
plt.figure(figsize=(8, 6))
# Loop through classifiers and plot ROC curves
for clf_name, clf in classifiers.items():
# Make predictions on the test data to get predicted probabilities
y_pred_proba = clf.predict_proba(X_test)[:, 1]
# Calculate ROC curve and AUC score
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
auc_score = roc_auc_score(y_test, y_pred_proba)
# Plot ROC curve
plt.plot(fpr, tpr, label=f'{clf_name} (AUC = {auc_score:.2f})')
# Plot the random guess line (diagonal)
plt.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Random Guess')
# Set labels and title
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.grid(True)
plt.show()
The most accurate classifier is random forest, scoring the highest precision, recall and F1 scores.
from sklearn.metrics import classification_report
from sklearn.model_selection import RandomizedSearchCV
import numpy as np
from sklearn.ensemble import RandomForestClassifier
# Reduce the size of the parameter grid
param_grid = {
'n_estimators': [100, 200],
'max_depth': [None, 10],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2]
}
rf = RandomForestClassifier(random_state=42)
random_search = RandomizedSearchCV(rf, param_distributions=param_grid, n_iter=5, cv=3, scoring='accuracy', n_jobs=-1)
random_search.fit(X_train, y_train)
best_params = random_search.best_params_
best_estimator = random_search.best_estimator_
y_pred = best_estimator.predict(X_test)
print(classification_report(y_test, y_pred))
rf=best_estimator
precision recall f1-score support
0 0.89 0.89 0.89 160
1 0.92 0.92 0.92 212
accuracy 0.91 372
macro avg 0.90 0.90 0.90 372
weighted avg 0.91 0.91 0.91 372
print(rf)
RandomForestClassifier(min_samples_split=5, random_state=42)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Extract feature importances
feature_importances = rf.feature_importances_
# Create a DataFrame for better visualization
feature_importances_df = pd.DataFrame({
'Feature': X_train.columns,
'Importance': feature_importances
})
# Sort the DataFrame by importance
feature_importances_df = feature_importances_df.sort_values(by='Importance', ascending=False)
# Visualize the feature importance
plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importances_df)
plt.title('Feature Importance in Random Forest')
plt.show()
importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)
print(importance_df)
Feature Importance 25 subscribing_channel_House Ads 0.153402 3 marketing_channel_House Ads 0.146371 6 variant_control 0.089454 7 variant_personalization 0.078922 26 subscribing_channel_Instagram 0.076525 4 marketing_channel_Instagram 0.050185 24 subscribing_channel_Facebook 0.049337 2 marketing_channel_Facebook 0.036749 1 marketing_channel_Email 0.033639 23 subscribing_channel_Email 0.024058 27 subscribing_channel_Push 0.020620 32 DoW_Thursday 0.020097 34 DoW_Wednesday 0.018875 5 marketing_channel_Push 0.018051 0 is_retained 0.017653 22 age_group_55+ years 0.015020 29 DoW_Monday 0.014997 17 age_group_19-24 years 0.014192 18 age_group_24-30 years 0.011974 20 age_group_36-45 years 0.011857 21 age_group_45-55 years 0.011037 33 DoW_Tuesday 0.010719 31 DoW_Sunday 0.010059 28 DoW_Friday 0.009808 16 age_group_0-18 years 0.009594 30 DoW_Saturday 0.009521 19 age_group_30-36 years 0.008535 13 language_preferred_English 0.007139 14 language_preferred_German 0.005747 9 language_displayed_English 0.004525 12 language_preferred_Arabic 0.002944 10 language_displayed_German 0.002831 15 language_preferred_Spanish 0.002760 8 language_displayed_Arabic 0.001411 11 language_displayed_Spanish 0.001392
The most important feature determining if someone converted or not was house_ads, after which it was the controlled advertisement and then personalized advertisement. The least important features were language related features. Thus, to improve conversion rates the business should advertise through house advertisements less and personalize their ads even more. This will lead to a higher conversion rate across the demographic groups (Ages 35-55) which are significantly lower.